Hello, Jolene! Nice to see you again! :)
My name is Olga. I'm happy to reviewing your project today.
The first time I see a mistake, I' will just point it out and let you find it and fix it yourself. In a real job, your boss will do the same, and I'm trying to prepare you to work as an Data Analyst. But if you can't handle this task yet, I will give you a more accurate hint at the next check.
Below you will find my comments - please do not move, modify or delete them .
You can find my comments in green, yellow or red boxes like this:
You can answer me by using this:
Project Description
You work at a startup that sells food products. You need to investigate user behavior for the company's app.
Tasks
Important things to consider
Description of the data
Instructions for completing the project
Step 1. Open the data file and read the general information
Step 2. Prepare the data for analysis
Step 3. Study and check the data
Step 4. Study the event funnel
Step 5. Study the results of the experiment
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
from plotly import graph_objects as go
import scipy.stats as st
import math as mth
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
sns.set_palette('bright')
pd.set_option('max_colwidth', 450)
Plan of Action
Open and save original data as a dataframe
# read and open original data as df
events_data_orig = pd.read_csv('datasets/logs_exp_us.csv', delimiter = "\t")
Examine general info
# review first few rows
events_data_orig.head(10)
| EventName | DeviceIDHash | EventTimestamp | ExpId | |
|---|---|---|---|---|
| 0 | MainScreenAppear | 4575588528974610257 | 1564029816 | 246 |
| 1 | MainScreenAppear | 7416695313311560658 | 1564053102 | 246 |
| 2 | PaymentScreenSuccessful | 3518123091307005509 | 1564054127 | 248 |
| 3 | CartScreenAppear | 3518123091307005509 | 1564054127 | 248 |
| 4 | PaymentScreenSuccessful | 6217807653094995999 | 1564055322 | 248 |
| 5 | CartScreenAppear | 6217807653094995999 | 1564055323 | 248 |
| 6 | OffersScreenAppear | 8351860793733343758 | 1564066242 | 246 |
| 7 | MainScreenAppear | 5682100281902512875 | 1564085677 | 246 |
| 8 | MainScreenAppear | 1850981295691852772 | 1564086702 | 247 |
| 9 | MainScreenAppear | 5407636962369102641 | 1564112112 | 246 |
events_data_orig.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 244126 entries, 0 to 244125 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EventName 244126 non-null object 1 DeviceIDHash 244126 non-null int64 2 EventTimestamp 244126 non-null int64 3 ExpId 244126 non-null int64 dtypes: int64(3), object(1) memory usage: 7.5+ MB
Take a closer look at the individual columns
events_data_orig['EventName'].value_counts(dropna = False)
MainScreenAppear 119205 OffersScreenAppear 46825 CartScreenAppear 42731 PaymentScreenSuccessful 34313 Tutorial 1052 Name: EventName, dtype: int64
events_data_orig['ExpId'].value_counts(dropna = False)
248 85747 246 80304 247 78075 Name: ExpId, dtype: int64
events_data_orig['DeviceIDHash'].nunique()
7551
Actions Performed
Plan of Action
Read and save working version of data
events_data = pd.read_csv('datasets/logs_exp_us.csv', delimiter = "\t")
events_data.columns
Index(['EventName', 'DeviceIDHash', 'EventTimestamp', 'ExpId'], dtype='object')
events_data = events_data.rename(columns = {'EventName': 'event_name',
'DeviceIDHash': 'user_id',
'EventTimestamp': 'event_datetime',
'ExpId': 'exp_id'})
events_data.columns
Index(['event_name', 'user_id', 'event_datetime', 'exp_id'], dtype='object')
# identify missing values
print('number of missing values in dataframe')
events_data.isnull().sum()
number of missing values in dataframe
event_name 0 user_id 0 event_datetime 0 exp_id 0 dtype: int64
2.b conclusion
# identify duplicate rows
num_duplicates = events_data.duplicated().sum()
print('{} duplicate rows'.format(num_duplicates))
413 duplicate rows
# examine duplicate rows
duplicate_rows = events_data[events_data.duplicated()]
duplicate_rows.head(20).sort_values(by = 'user_id')
| event_name | user_id | event_datetime | exp_id | |
|---|---|---|---|---|
| 3573 | MainScreenAppear | 434103746454591587 | 1564628377 | 248 |
| 2350 | CartScreenAppear | 1694940645335807244 | 1564609899 | 248 |
| 14333 | CartScreenAppear | 1807104407388801321 | 1564655427 | 248 |
| 9179 | MainScreenAppear | 2230705996155527339 | 1564646087 | 246 |
| 4803 | MainScreenAppear | 2835328739789306622 | 1564634641 | 248 |
| 15746 | PaymentScreenSuccessful | 2877433916175408776 | 1564658001 | 247 |
| 15751 | PaymentScreenSuccessful | 3528217211962170139 | 1564658003 | 247 |
| 15752 | PaymentScreenSuccessful | 3528217211962170139 | 1564658003 | 247 |
| 15753 | PaymentScreenSuccessful | 3528217211962170139 | 1564658003 | 247 |
| 4076 | MainScreenAppear | 3761373764179762633 | 1564631266 | 247 |
| 5641 | CartScreenAppear | 4248762472840564256 | 1564637764 | 248 |
| 12454 | PaymentScreenSuccessful | 5152160705477623487 | 1564652139 | 247 |
| 9311 | MainScreenAppear | 5496043151846125970 | 1564646283 | 248 |
| 453 | MainScreenAppear | 5613408041324010552 | 1564474784 | 248 |
| 13055 | PaymentScreenSuccessful | 6258460144399027762 | 1564653224 | 247 |
| 5875 | PaymentScreenSuccessful | 6427012997733591237 | 1564638452 | 248 |
| 9990 | PaymentScreenSuccessful | 7035352794299231933 | 1564647627 | 248 |
| 7249 | OffersScreenAppear | 7224691986599895551 | 1564641846 | 246 |
| 15304 | PaymentScreenSuccessful | 8125832085431322921 | 1564657270 | 246 |
| 8065 | CartScreenAppear | 8189122927585332969 | 1564643929 | 248 |
perc_duplicates = num_duplicates / len(events_data)
print('{:.2%} of the data are dupicates'.format(perc_duplicates))
0.17% of the data are dupicates
# duplicates per experiment
exp_dups = duplicate_rows['exp_id'].value_counts(dropna = False).reset_index()
exp_dups.columns = ['exp_id', 'num_dups']
exp_dups
| exp_id | num_dups | |
|---|---|---|
| 0 | 248 | 165 |
| 1 | 247 | 125 |
| 2 | 246 | 123 |
# total events per exeriment
group_exp = events_data['exp_id'].value_counts(dropna = False).reset_index()
group_exp.columns = ['exp_id', 'total_exp']
group_exp
| exp_id | total_exp | |
|---|---|---|
| 0 | 248 | 85747 |
| 1 | 246 | 80304 |
| 2 | 247 | 78075 |
# combine tables
group_dup_perc = pd.merge(group_exp, exp_dups, on = 'exp_id')
group_dup_perc
| exp_id | total_exp | num_dups | |
|---|---|---|---|
| 0 | 248 | 85747 | 165 |
| 1 | 246 | 80304 | 123 |
| 2 | 247 | 78075 | 125 |
group_dup_perc['perc_dup'] = (group_dup_perc['num_dups'] / group_dup_perc['total_exp'] * 100).round(2)
group_dup_perc
| exp_id | total_exp | num_dups | perc_dup | |
|---|---|---|---|---|
| 0 | 248 | 85747 | 165 | 0.19 |
| 1 | 246 | 80304 | 123 | 0.15 |
| 2 | 247 | 78075 | 125 | 0.16 |
# delete duplicate rows and reset index
events_data = events_data.drop_duplicates().reset_index(drop = True)
# confirm duplicate rows are removed
print('number of duplicate rows after duplicates removed')
events_data.duplicated().sum()
number of duplicate rows after duplicates removed
0
# change unix timestamp to datetime
events_data['event_datetime'] = pd.to_datetime(events_data['event_datetime'], unit = 's')
events_data.head()
| event_name | user_id | event_datetime | exp_id | |
|---|---|---|---|---|
| 0 | MainScreenAppear | 4575588528974610257 | 2019-07-25 04:43:36 | 246 |
| 1 | MainScreenAppear | 7416695313311560658 | 2019-07-25 11:11:42 | 246 |
| 2 | PaymentScreenSuccessful | 3518123091307005509 | 2019-07-25 11:28:47 | 248 |
| 3 | CartScreenAppear | 3518123091307005509 | 2019-07-25 11:28:47 | 248 |
| 4 | PaymentScreenSuccessful | 6217807653094995999 | 2019-07-25 11:48:42 | 248 |
events_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 243713 entries, 0 to 243712 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 event_name 243713 non-null object 1 user_id 243713 non-null int64 2 event_datetime 243713 non-null datetime64[ns] 3 exp_id 243713 non-null int64 dtypes: datetime64[ns](1), int64(2), object(1) memory usage: 7.4+ MB
events_data['event_date'] = events_data['event_datetime'].astype('datetime64[D]')
events_data['event_time'] = events_data['event_datetime'].dt.time
events_data.head()
| event_name | user_id | event_datetime | exp_id | event_date | event_time | |
|---|---|---|---|---|---|---|
| 0 | MainScreenAppear | 4575588528974610257 | 2019-07-25 04:43:36 | 246 | 2019-07-25 | 04:43:36 |
| 1 | MainScreenAppear | 7416695313311560658 | 2019-07-25 11:11:42 | 246 | 2019-07-25 | 11:11:42 |
| 2 | PaymentScreenSuccessful | 3518123091307005509 | 2019-07-25 11:28:47 | 248 | 2019-07-25 | 11:28:47 |
| 3 | CartScreenAppear | 3518123091307005509 | 2019-07-25 11:28:47 | 248 | 2019-07-25 | 11:28:47 |
| 4 | PaymentScreenSuccessful | 6217807653094995999 | 2019-07-25 11:48:42 | 248 | 2019-07-25 | 11:48:42 |
events_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 243713 entries, 0 to 243712 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 event_name 243713 non-null object 1 user_id 243713 non-null int64 2 event_datetime 243713 non-null datetime64[ns] 3 exp_id 243713 non-null int64 4 event_date 243713 non-null datetime64[ns] 5 event_time 243713 non-null object dtypes: datetime64[ns](2), int64(2), object(2) memory usage: 11.2+ MB
Actions Performed
# review data
events_data.head(2)
| event_name | user_id | event_datetime | exp_id | event_date | event_time | |
|---|---|---|---|---|---|---|
| 0 | MainScreenAppear | 4575588528974610257 | 2019-07-25 04:43:36 | 246 | 2019-07-25 | 04:43:36 |
| 1 | MainScreenAppear | 7416695313311560658 | 2019-07-25 11:11:42 | 246 | 2019-07-25 | 11:11:42 |
# calculate the number of entries
num_events = len(events_data)
print('There are {} events in the duplicate free data. \n'.format(num_events))
# create a table with counts of each experiments events
events_per_exp = events_data.pivot_table(index = 'exp_id', columns = 'event_name',
values = 'user_id', aggfunc = 'count',
margins = True, margins_name = 'TotalEvents')
print('Table: Number of events per experiment and number of event types per experiment')
events_per_exp
There are 243713 events in the duplicate free data. Table: Number of events per experiment and number of event types per experiment
| event_name | CartScreenAppear | MainScreenAppear | OffersScreenAppear | PaymentScreenSuccessful | Tutorial | TotalEvents |
|---|---|---|---|---|---|---|
| exp_id | ||||||
| 246 | 14798 | 38249 | 14904 | 11912 | 318 | 80181 |
| 247 | 12548 | 39677 | 15341 | 10039 | 345 | 77950 |
| 248 | 15322 | 41175 | 16563 | 12167 | 355 | 85582 |
| TotalEvents | 42668 | 119101 | 46808 | 34118 | 1018 | 243713 |
# calculate the number of unique users
num_users = events_data['user_id'].nunique()
print('There are {} unique users in the duplicate free data.\n'.format(num_users))
# create a table of the number of users per group
users_per_exp = events_data.groupby('exp_id').agg({'user_id': 'nunique'}).reset_index()
users_per_exp.columns = ['exp_id', 'num_users']
print('Table: Number of users per experiment')
users_per_exp
There are 7551 unique users in the duplicate free data. Table: Number of users per experiment
| exp_id | num_users | |
|---|---|---|
| 0 | 246 | 2489 |
| 1 | 247 | 2520 |
| 2 | 248 | 2542 |
# calculate the average number of events per user
num_events_per_user = events_data.groupby('user_id')['event_name'].count().mean().round(0)
print('The average number of events per user is {}'.format(num_events_per_user))
The average number of events per user is 32.0
3.d What period of time does the data cover?
# find the min and max dates
min_date = events_data['event_date'].min().date()
max_date = events_data['event_date'].max().date()
print('The data covers dates from {} to {}.'.format(min_date, max_date))
The data covers dates from 2019-07-25 to 2019-08-07.
# find the number of events per date
events_per_date = events_data.groupby('event_date').agg({'exp_id': 'count'}).reset_index()
events_per_date.columns = ['event_date', 'num_events_per_date']
events_per_date
| event_date | num_events_per_date | |
|---|---|---|
| 0 | 2019-07-25 | 9 |
| 1 | 2019-07-26 | 31 |
| 2 | 2019-07-27 | 55 |
| 3 | 2019-07-28 | 105 |
| 4 | 2019-07-29 | 184 |
| 5 | 2019-07-30 | 412 |
| 6 | 2019-07-31 | 2030 |
| 7 | 2019-08-01 | 36141 |
| 8 | 2019-08-02 | 35554 |
| 9 | 2019-08-03 | 33282 |
| 10 | 2019-08-04 | 32968 |
| 11 | 2019-08-05 | 36058 |
| 12 | 2019-08-06 | 35788 |
| 13 | 2019-08-07 | 31096 |
# plot the results
plt.figure(figsize = (12, 6))
events_per_date_graph = sns.barplot(data = events_per_date, x = events_per_date['event_date'].dt.date, y = 'num_events_per_date') #,
plt.title('Number of events per date', size = 15)
plt.xlabel('date', size = 15)
plt.xticks(rotation = 60)
plt.ylabel('number of events', size = 15)
for bar in events_per_date_graph.patches:
events_per_date_graph.annotate(format(bar.get_height(), '.0f'),
xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 7),
textcoords = 'offset points')
plt.show()
3.d.4 What period does the data actually represent?
# create a new table excluding the dates with minimal data
events_filtered = events_data.query('event_date > "2019-07-31"').reset_index(drop = True)
events_filtered['event_date'].value_counts()
2019-08-01 36141 2019-08-05 36058 2019-08-06 35788 2019-08-02 35554 2019-08-03 33282 2019-08-04 32968 2019-08-07 31096 Name: event_date, dtype: int64
# filtered data
# calculate the number of entries
filtered_num_events = len(events_filtered)
print('There are {} events in the filtered data. \n'.format(filtered_num_events))
# create a table with counts of each experiments events
filtered_events_per_exp = events_filtered.pivot_table(index = 'exp_id', columns = 'event_name',
values = 'user_id', aggfunc = 'count',
margins = True, margins_name = 'TotalEventsFiltered')
print('Table: Number of events per experiment and number of event types per experiment in filtered data')
filtered_events_per_exp
There are 240887 events in the filtered data. Table: Number of events per experiment and number of event types per experiment in filtered data
| event_name | CartScreenAppear | MainScreenAppear | OffersScreenAppear | PaymentScreenSuccessful | Tutorial | TotalEventsFiltered |
|---|---|---|---|---|---|---|
| exp_id | ||||||
| 246 | 14690 | 37676 | 14767 | 11852 | 317 | 79302 |
| 247 | 12434 | 39090 | 15179 | 9981 | 338 | 77022 |
| 248 | 15179 | 40562 | 16387 | 12085 | 350 | 84563 |
| TotalEventsFiltered | 42303 | 117328 | 46333 | 33918 | 1005 | 240887 |
# unfiltered data
print('Table: Number of events per experiment and number of event types per experiment (unfiltered data)')
events_per_exp
Table: Number of events per experiment and number of event types per experiment (unfiltered data)
| event_name | CartScreenAppear | MainScreenAppear | OffersScreenAppear | PaymentScreenSuccessful | Tutorial | TotalEvents |
|---|---|---|---|---|---|---|
| exp_id | ||||||
| 246 | 14798 | 38249 | 14904 | 11912 | 318 | 80181 |
| 247 | 12548 | 39677 | 15341 | 10039 | 345 | 77950 |
| 248 | 15322 | 41175 | 16563 | 12167 | 355 | 85582 |
| TotalEvents | 42668 | 119101 | 46808 | 34118 | 1018 | 243713 |
#calculate the difference between filtered and unfiltered events
event_diff = len(events_data) - filtered_num_events
print('The unfitered data has {} entries, the filtered data has {} entries.\nThis is a difference of {} entries and accounts for {:.1%} of the unfiltered data'.format(len(events_data), len(events_filtered), event_diff, (event_diff / len(events_data))))
The unfitered data has 243713 entries, the filtered data has 240887 entries. This is a difference of 2826 entries and accounts for 1.2% of the unfiltered data
Did you lose many events and users when excluding the older data?
# calculate the number of unique users in the filtered data
filtered_num_users = events_filtered['user_id'].nunique()
print('There are {} unique users in the filtered data.\n'.format(filtered_num_users))
# create a table of the number of users per group in the filtered data
filtered_users_per_exp = events_filtered.groupby('exp_id').agg({'user_id': 'nunique'}).reset_index()
filtered_users_per_exp.columns = ['exp_id', 'filtered_num_users']
print('Table: Number of users per experiment in filtered data')
filtered_users_per_exp
There are 7534 unique users in the filtered data. Table: Number of users per experiment in filtered data
| exp_id | filtered_num_users | |
|---|---|---|
| 0 | 246 | 2484 |
| 1 | 247 | 2513 |
| 2 | 248 | 2537 |
# unfiltered data
print('There are {} unique users in the duplicate free data.\n'.format(num_users))
print('Table: Number of users per experiment (unfiltered data)')
users_per_exp
There are 7551 unique users in the duplicate free data. Table: Number of users per experiment (unfiltered data)
| exp_id | num_users | |
|---|---|---|
| 0 | 246 | 2489 |
| 1 | 247 | 2520 |
| 2 | 248 | 2542 |
# determine the percent of users filtered out per experiment
filtered_users_per_exp['perc_removed_users'] = ((1 - (filtered_users_per_exp['filtered_num_users'] / users_per_exp['num_users'])) * 100).round(2)
filtered_users_per_exp
| exp_id | filtered_num_users | perc_removed_users | |
|---|---|---|---|
| 0 | 246 | 2484 | 0.20 |
| 1 | 247 | 2513 | 0.28 |
| 2 | 248 | 2537 | 0.20 |
user_diff = num_users - filtered_num_users
print('The unfiltered data has {} number of users, the filtered data has {}.\nThis is a difference of {} and accounts for {:.1%} of the unfiltered users'.format(num_users, filtered_num_users, user_diff, (user_diff / num_users)))
print('The filtered users per experiment group account for {} percent per group'.format(list(filtered_users_per_exp['perc_removed_users'])))
The unfiltered data has 7551 number of users, the filtered data has 7534. This is a difference of 17 and accounts for 0.2% of the unfiltered users The filtered users per experiment group account for [0.2, 0.28, 0.2] percent per group
Make sure you have users from all three experimental groups.
After removing duplicates
After removing insufficient data and focusing on events from 2019-08-01 to 2019-08-07
# review data
events_filtered.head(2)
| event_name | user_id | event_datetime | exp_id | event_date | event_time | |
|---|---|---|---|---|---|---|
| 0 | Tutorial | 3737462046622621720 | 2019-08-01 00:07:28 | 246 | 2019-08-01 | 00:07:28 |
| 1 | MainScreenAppear | 3737462046622621720 | 2019-08-01 00:08:00 | 246 | 2019-08-01 | 00:08:00 |
# find the number of events per type
event_counts = events_filtered.groupby('event_name').agg({'exp_id': 'count'}).reset_index()
event_counts.columns = ['event_name', 'num_events_per_type']
event_counts
| event_name | num_events_per_type | |
|---|---|---|
| 0 | CartScreenAppear | 42303 |
| 1 | MainScreenAppear | 117328 |
| 2 | OffersScreenAppear | 46333 |
| 3 | PaymentScreenSuccessful | 33918 |
| 4 | Tutorial | 1005 |
# find the percentages of each event type
event_counts['perc'] = (event_counts['num_events_per_type'] / len(events_filtered) * 100).round(2)
event_counts.sort_values(by = 'num_events_per_type', ascending = False)
| event_name | num_events_per_type | perc | |
|---|---|---|---|
| 1 | MainScreenAppear | 117328 | 48.71 |
| 2 | OffersScreenAppear | 46333 | 19.23 |
| 0 | CartScreenAppear | 42303 | 17.56 |
| 3 | PaymentScreenSuccessful | 33918 | 14.08 |
| 4 | Tutorial | 1005 | 0.42 |
# plot the results
plt.figure(figsize = (12, 6))
event_counts_graph = sns.barplot(data = event_counts, x = 'event_name', y = 'num_events_per_type',
order = event_counts.sort_values(by = 'num_events_per_type', ascending = False).event_name)
plt.title('Number of Events per Type', size = 15)
plt.xlabel('event type', size = 15)
plt.ylabel('number of events')
plt.xticks(rotation = 30)
for bar in event_counts_graph.patches:
event_counts_graph.annotate(format(bar.get_height(), '.0f'),
xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 7),
textcoords = 'offset points')
plt.show()
See what events are in the logs and their frequency of occurrence. Sort them by frequency.
# review data
events_filtered.head(2)
| event_name | user_id | event_datetime | exp_id | event_date | event_time | |
|---|---|---|---|---|---|---|
| 0 | Tutorial | 3737462046622621720 | 2019-08-01 00:07:28 | 246 | 2019-08-01 | 00:07:28 |
| 1 | MainScreenAppear | 3737462046622621720 | 2019-08-01 00:08:00 | 246 | 2019-08-01 | 00:08:00 |
# create a table to find the number of unique users per event
users_per_event = events_filtered.groupby('event_name').agg({'user_id': 'nunique'}).reset_index()
users_per_event.columns = ['event_name', 'num_unique_users']
users_per_event['perc_users'] = (users_per_event['num_unique_users'] / events_filtered['user_id'].nunique() * 100).round()
users_per_event.sort_values(by = 'num_unique_users')
| event_name | num_unique_users | perc_users | |
|---|---|---|---|
| 4 | Tutorial | 840 | 11.0 |
| 3 | PaymentScreenSuccessful | 3539 | 47.0 |
| 0 | CartScreenAppear | 3734 | 50.0 |
| 2 | OffersScreenAppear | 4593 | 61.0 |
| 1 | MainScreenAppear | 7419 | 98.0 |
# plot the counts
plt.figure(figsize = (12, 6))
users_per_event_graph = sns.barplot(data = users_per_event, x = 'event_name', y = 'num_unique_users',
order = users_per_event.sort_values(by = 'num_unique_users', ascending = False).event_name)
plt.title('Number of Unique Users per Event Type', size = 15)
plt.xlabel('event type', size = 15)
plt.ylabel('number of unique users')
plt.xticks(rotation = 30)
for bar in users_per_event_graph.patches:
users_per_event_graph.annotate(format(bar.get_height(), '.0f'),
xy=(bar.get_x() + bar.get_width() / 2, bar.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 7),
textcoords = 'offset points')
plt.show()
Find the number of users who performed each of these actions. Sort the events by the number of users.
# review data
events_filtered.head(2)
| event_name | user_id | event_datetime | exp_id | event_date | event_time | |
|---|---|---|---|---|---|---|
| 0 | Tutorial | 3737462046622621720 | 2019-08-01 00:07:28 | 246 | 2019-08-01 | 00:07:28 |
| 1 | MainScreenAppear | 3737462046622621720 | 2019-08-01 00:08:00 | 246 | 2019-08-01 | 00:08:00 |
# determine what actions a user performed at least once
user_event_breakdown = events_filtered.groupby(['user_id', 'event_name']).agg({'event_date': 'nunique'})
user_event_breakdown.head(15)
| event_date | ||
|---|---|---|
| user_id | event_name | |
| 6888746892508752 | MainScreenAppear | 1 |
| 6909561520679493 | CartScreenAppear | 1 |
| MainScreenAppear | 1 | |
| OffersScreenAppear | 1 | |
| PaymentScreenSuccessful | 1 | |
| 6922444491712477 | CartScreenAppear | 3 |
| MainScreenAppear | 3 | |
| OffersScreenAppear | 3 | |
| PaymentScreenSuccessful | 3 | |
| 7435777799948366 | MainScreenAppear | 2 |
| 7702139951469979 | CartScreenAppear | 4 |
| MainScreenAppear | 7 | |
| OffersScreenAppear | 7 | |
| PaymentScreenSuccessful | 4 | |
| 8486814028069281 | CartScreenAppear | 1 |
# count the number of actions perfromed by an individual user
user_actions = events_filtered.groupby('user_id').agg({'event_name': 'nunique'}).reset_index()
user_actions.columns = ['user_id', 'num_actions_performed']
user_actions.head()
| user_id | num_actions_performed | |
|---|---|---|
| 0 | 6888746892508752 | 1 |
| 1 | 6909561520679493 | 4 |
| 2 | 6922444491712477 | 4 |
| 3 | 7435777799948366 | 1 |
| 4 | 7702139951469979 | 4 |
# count the number of users that performed the actions
action_proportions = user_actions.groupby('num_actions_performed').agg({'user_id': 'count'}).reset_index()
action_proportions.columns = ['num_actions_performed', 'num_users']
action_proportions
| num_actions_performed | num_users | |
|---|---|---|
| 0 | 1 | 2717 |
| 1 | 2 | 1004 |
| 2 | 3 | 318 |
| 3 | 4 | 3029 |
| 4 | 5 | 466 |
# create a new column with proportions of users who performed the actions
action_proportions['percentage_performed'] = (action_proportions['num_users'] / len(user_actions) * 100).round(2)
action_proportions
| num_actions_performed | num_users | percentage_performed | |
|---|---|---|---|
| 0 | 1 | 2717 | 36.06 |
| 1 | 2 | 1004 | 13.33 |
| 2 | 3 | 318 | 4.22 |
| 3 | 4 | 3029 | 40.20 |
| 4 | 5 | 466 | 6.19 |
# plot the results
plt.figure(figsize = (7, 7))
plt.pie(x = action_proportions['percentage_performed'], labels = action_proportions['num_actions_performed'],
autopct = '%.0f%%', textprops={'fontsize': 15})
plt.title('Percentage of Number of Actions Performed by Individual Users', size = 15)
plt.show()
Calculate the proportion of users who performed the action at least once.
Percentage of Number of Actions Performed by Individual Users
In what order do you think the actions took place. Are all of them part of a single sequence? You don't need to take them into account when calculating the funnel.
Based on Number of Events per Type & Number of Unique Users per Event Type graphs, the order of actions is likely:
While there may be a standard pattern followed by most users, I dont think the events must happen in this order.
# create a table focusing on required funnel
funnel = users_per_event.query('event_name != "Tutorial"')
funnel
| event_name | num_unique_users | perc_users | |
|---|---|---|---|
| 0 | CartScreenAppear | 3734 | 50.0 |
| 1 | MainScreenAppear | 7419 | 98.0 |
| 2 | OffersScreenAppear | 4593 | 61.0 |
| 3 | PaymentScreenSuccessful | 3539 | 47.0 |
# order the users per event
funnel.sort_values(by = 'perc_users', ascending = False).reset_index(drop = True)
| event_name | num_unique_users | perc_users | |
|---|---|---|---|
| 0 | MainScreenAppear | 7419 | 98.0 |
| 1 | OffersScreenAppear | 4593 | 61.0 |
| 2 | CartScreenAppear | 3734 | 50.0 |
| 3 | PaymentScreenSuccessful | 3539 | 47.0 |
funnel['perc_change'] = round(funnel['num_unique_users'].sort_values(ascending = False).pct_change(periods = 1) * 100, 0)
funnel = funnel.fillna(0)
funnel.sort_values(by = 'num_unique_users', ascending = False)
| event_name | num_unique_users | perc_users | perc_change | |
|---|---|---|---|---|
| 1 | MainScreenAppear | 7419 | 98.0 | 0.0 |
| 2 | OffersScreenAppear | 4593 | 61.0 | -38.0 |
| 0 | CartScreenAppear | 3734 | 50.0 | -19.0 |
| 3 | PaymentScreenSuccessful | 3539 | 47.0 | -5.0 |
Use the event funnel to find the share of users that proceed from each stage to the next. (For instance, for the sequence of events A → B → C, calculate the ratio of users at stage B to the number of users at stage A and the ratio of users at stage C to the number at stage B.)
# reset index with sorted values
funnel_sort = funnel.sort_values(by = 'num_unique_users', ascending = False).reset_index()
funnel_sort
| index | event_name | num_unique_users | perc_users | perc_change | |
|---|---|---|---|---|---|
| 0 | 1 | MainScreenAppear | 7419 | 98.0 | 0.0 |
| 1 | 2 | OffersScreenAppear | 4593 | 61.0 | -38.0 |
| 2 | 0 | CartScreenAppear | 3734 | 50.0 | -19.0 |
| 3 | 3 | PaymentScreenSuccessful | 3539 | 47.0 | -5.0 |
# plot a funnel chart
funnel_chart = go.Figure(go.Funnel(y = list(funnel_sort['event_name']),
x = list(funnel_sort['num_unique_users'])))
funnel_chart.update_layout(title = 'Sales/Event Funnel Chart')
funnel_chart.show()
At what stage do you lose the most users?
What share of users make the entire journey from their first event to payment?
# review data
events_filtered.head(2)
| event_name | user_id | event_datetime | exp_id | event_date | event_time | |
|---|---|---|---|---|---|---|
| 0 | Tutorial | 3737462046622621720 | 2019-08-01 00:07:28 | 246 | 2019-08-01 | 00:07:28 |
| 1 | MainScreenAppear | 3737462046622621720 | 2019-08-01 00:08:00 | 246 | 2019-08-01 | 00:08:00 |
# count the number of users per group
exp_groups = events_filtered.groupby('exp_id').agg({'user_id': 'nunique'}).reset_index()
exp_groups.columns = ['exp_id', 'users_per_group']
exp_groups
| exp_id | users_per_group | |
|---|---|---|
| 0 | 246 | 2484 |
| 1 | 247 | 2513 |
| 2 | 248 | 2537 |
exp_perc_diff = 1 - exp_groups['users_per_group'].min() / exp_groups['users_per_group'].max()
print('There is a {:.1%} difference between exp 246 (least users) and exp 248 (most users)'.format(exp_perc_diff))
There is a 2.1% difference between exp 246 (least users) and exp 248 (most users)
How many users are there in each group?
5.b Determine if there is a statistically significant difference between sample groups
# determine if there are users in more than one group
users_test = events_filtered.groupby('user_id').agg({'exp_id': 'nunique'})
users_test.head()
| exp_id | |
|---|---|
| user_id | |
| 6888746892508752 | 1 |
| 6909561520679493 | 1 |
| 6922444491712477 | 1 |
| 7435777799948366 | 1 |
| 7702139951469979 | 1 |
users_query = users_test.query('exp_id > 1')
users_query
| exp_id | |
|---|---|
| user_id |
# determine the percent difference between groups
exp_groups['perc_diff'] = round(exp_groups['users_per_group'].pct_change(periods = 1) * 100, 3)
exp_groups
| exp_id | users_per_group | perc_diff | |
|---|---|---|---|
| 0 | 246 | 2484 | NaN |
| 1 | 247 | 2513 | 1.167 |
| 2 | 248 | 2537 | 0.955 |
# prepare a table for hypothesis testing the experimental groups and combinations
experiments = events_filtered.pivot_table(index = 'event_name', columns = 'exp_id',
values = 'user_id', aggfunc = 'nunique').reset_index()
experiments.columns = ['event_name', '246', '247', '248']
experiments
| event_name | 246 | 247 | 248 | |
|---|---|---|---|---|
| 0 | CartScreenAppear | 1266 | 1238 | 1230 |
| 1 | MainScreenAppear | 2450 | 2476 | 2493 |
| 2 | OffersScreenAppear | 1542 | 1520 | 1531 |
| 3 | PaymentScreenSuccessful | 1200 | 1158 | 1181 |
| 4 | Tutorial | 278 | 283 | 279 |
# create a column with the combination of the two control groups
experiments['249'] = experiments['246'] + experiments['247']
experiments
| event_name | 246 | 247 | 248 | 249 | |
|---|---|---|---|---|---|
| 0 | CartScreenAppear | 1266 | 1238 | 1230 | 2504 |
| 1 | MainScreenAppear | 2450 | 2476 | 2493 | 4926 |
| 2 | OffersScreenAppear | 1542 | 1520 | 1531 | 3062 |
| 3 | PaymentScreenSuccessful | 1200 | 1158 | 1181 | 2358 |
| 4 | Tutorial | 278 | 283 | 279 | 561 |
# create a new table to combine the count of users from the control groups
# count the number of users per group
exp_groups_counts = events_filtered.groupby('exp_id').agg({'user_id': 'nunique'}).reset_index()
exp_groups_counts.columns = ['exp_id', 'users_per_group']
exp_groups_counts
| exp_id | users_per_group | |
|---|---|---|
| 0 | 246 | 2484 |
| 1 | 247 | 2513 |
| 2 | 248 | 2537 |
# create a dictionary with the count of the control groups users combined
controls_combo = {'exp_id': 249,
'users_per_group': exp_groups_counts[exp_groups_counts['exp_id'] == 246]['users_per_group'].iloc[0] +
exp_groups_counts[exp_groups_counts['exp_id'] == 247]['users_per_group'].iloc[0]}
controls_combo
{'exp_id': 249, 'users_per_group': 4997}
# combine into one dataframe
exp_counts_combo = exp_groups_counts.append(controls_combo, ignore_index = True)
exp_counts_combo
| exp_id | users_per_group | |
|---|---|---|
| 0 | 246 | 2484 |
| 1 | 247 | 2513 |
| 2 | 248 | 2537 |
| 3 | 249 | 4997 |
Testing the Hypothesis that Proportions Are Equal - Lesson
In stats we learned about hypothesis testing of the means of populations
- made conclusions based on samples comparing their means w/ a certain number or seeing if 2 means are equal to each other
Another typical task is to test hypotheses about the equality of proportions of populations
The difference between the proportions we observe in our samples will be our STATISTIC
Z = (P1 - P2) - (π₁ - π₂) / sqrt(P(1 - P)(1/n1 + 1/n2)) ~ N(0,1)
Z = standard value for a criterion with a standard normal distribution, where the mean is 0 and the standard deviation is 1 the expression is distributed as N(0,1)
n1, n2 = sizes of the two samples being compared ie the number of observations they contain
P1, P2 = proportions observed in the samples
P = P1 + P2
π₁, π₂ = the actual proportions in the populations being compared
With A/B testing, one usually tests the hypothesis that π₁ = π₂. Then, if the null hypothesis is true, the expression (π₁ - π₂) in the nominator will equal 0 and the criterion can be calculated using only the sample data. The statistic thus obtained will be normally distributed, making it possible to carry out two-sided and one-sided (bilateral and unilateral) tests. Using the same null hypothesis that two populations' proportions are equal, we can test the alternative hypotheses that 1) the proportions simply aren't equal, or that 2) one proportion is larger or smaller than the other.
# create a function for testing the hypotheses
def test_hypothesis(exp_a, exp_b, alpha):
# iterate over each event
for event in experiments['event_name']:
# define the successes, i.e. the number of users who performed the action
successes_a = experiments[experiments['event_name'] == event][exp_a].iloc[0]
successes_b = experiments[experiments['event_name'] == event][exp_b].iloc[0]
# define the trials, i.e. the number of users in the experiment group
exp_group_a = exp_counts_combo[exp_counts_combo['exp_id'] == int(exp_a)]['users_per_group'].iloc[0]
exp_group_b = exp_counts_combo[exp_counts_combo['exp_id'] == int(exp_b)]['users_per_group'].iloc[0]
# proportion of success in experiment a
p1 = successes_a / exp_group_a
# proportion of success in experiment b
p2 = successes_b / exp_group_b
# success proportion in the combined dataset
p_combined = (successes_a + successes_b) / (exp_group_a + exp_group_b)
# difference between the proportions
difference = p1 - p2
# calculating the statistic in standard deviations of the standard normal distribution
# Z = standard value for a criterion with a standard normal distribution,
# where the mean is 0 and the standard deviation is 1 the expression is distributed as N(0,1)
z = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/exp_group_a + 1/exp_group_b))
# set up the standard normal distribution (mean 0, standard deviation 1)
distr = st.norm(0, 1)
# calculate the p-value
p_value = (1 - distr.cdf(abs(z))) * 2
print('Event: {}:'.format(event))
print('\tp-value: ', p_value)
if (p_value < alpha):
print('\tReject the null hypothesis for {}, there is a signficant difference between the proportions of experiments {} and {}.'.format(event, exp_a, exp_b))
else:
print('\tFail to reject the null hypothesis for {}, there is not a signficant difference between the proportions of experiments {} and {}.'.format(event, exp_a, exp_b))
# print the number of users and share of users per event for each experiment group
print('\tExperiment:\t\t{}\t{}\n\tNumber of users:\t{}\t{}\n\tShare of users:\t\t{:.2%}\t{:.2%}'.format(exp_a, exp_b, exp_group_a, exp_group_b, p1, p2), '\n')
Control group 246 vs control group 247
# test control experiments
print('control group 246 vs control group 247\n')
test_hypothesis(exp_a = '246', exp_b = '247', alpha = 0.01)
control group 246 vs control group 247 Event: CartScreenAppear: p-value: 0.22883372237997213 Fail to reject the null hypothesis for CartScreenAppear, there is not a signficant difference between the proportions of experiments 246 and 247. Experiment: 246 247 Number of users: 2484 2513 Share of users: 50.97% 49.26% Event: MainScreenAppear: p-value: 0.7570597232046099 Fail to reject the null hypothesis for MainScreenAppear, there is not a signficant difference between the proportions of experiments 246 and 247. Experiment: 246 247 Number of users: 2484 2513 Share of users: 98.63% 98.53% Event: OffersScreenAppear: p-value: 0.2480954578522181 Fail to reject the null hypothesis for OffersScreenAppear, there is not a signficant difference between the proportions of experiments 246 and 247. Experiment: 246 247 Number of users: 2484 2513 Share of users: 62.08% 60.49% Event: PaymentScreenSuccessful: p-value: 0.11456679313141849 Fail to reject the null hypothesis for PaymentScreenSuccessful, there is not a signficant difference between the proportions of experiments 246 and 247. Experiment: 246 247 Number of users: 2484 2513 Share of users: 48.31% 46.08% Event: Tutorial: p-value: 0.9376996189257114 Fail to reject the null hypothesis for Tutorial, there is not a signficant difference between the proportions of experiments 246 and 247. Experiment: 246 247 Number of users: 2484 2513 Share of users: 11.19% 11.26%
Control Group 246 vs Test Group 248
# test control group vs test group
print('control group 246 vs test group 248\n')
test_hypothesis(exp_a = '246', exp_b = '248', alpha = 0.05)
control group 246 vs test group 248 Event: CartScreenAppear: p-value: 0.07842923237520116 Fail to reject the null hypothesis for CartScreenAppear, there is not a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 50.97% 48.48% Event: MainScreenAppear: p-value: 0.2949721933554552 Fail to reject the null hypothesis for MainScreenAppear, there is not a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 98.63% 98.27% Event: OffersScreenAppear: p-value: 0.20836205402738917 Fail to reject the null hypothesis for OffersScreenAppear, there is not a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 62.08% 60.35% Event: PaymentScreenSuccessful: p-value: 0.2122553275697796 Fail to reject the null hypothesis for PaymentScreenSuccessful, there is not a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 48.31% 46.55% Event: Tutorial: p-value: 0.8264294010087645 Fail to reject the null hypothesis for Tutorial, there is not a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 11.19% 11.00%
Control Group 246 vs Test Group 248 Conclusion
Control Group 247 vs Test Group 248
# test control group vs test group
print('control group 247 vs test group 248\n')
test_hypothesis(exp_a = '247', exp_b = '248', alpha = 0.05)
control group 247 vs test group 248 Event: CartScreenAppear: p-value: 0.5786197879539783 Fail to reject the null hypothesis for CartScreenAppear, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 49.26% 48.48% Event: MainScreenAppear: p-value: 0.4587053616621515 Fail to reject the null hypothesis for MainScreenAppear, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 98.53% 98.27% Event: OffersScreenAppear: p-value: 0.9197817830592261 Fail to reject the null hypothesis for OffersScreenAppear, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 60.49% 60.35% Event: PaymentScreenSuccessful: p-value: 0.7373415053803964 Fail to reject the null hypothesis for PaymentScreenSuccessful, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 46.08% 46.55% Event: Tutorial: p-value: 0.765323922474501 Fail to reject the null hypothesis for Tutorial, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 11.26% 11.00%
Control Group 247 vs Test Group 248 Conclusion
Combined Control Group 249 (246 + 247) vs Test Group 248
# test combined control group vs test group
print('control groups combined 249 (246 + 247) vs test group 248\n')
test_hypothesis(exp_a = '249', exp_b = '248', alpha = 0.05)
control groups combined 249 (246 + 247) vs test group 248 Event: CartScreenAppear: p-value: 0.18175875284404386 Fail to reject the null hypothesis for CartScreenAppear, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 50.11% 48.48% Event: MainScreenAppear: p-value: 0.29424526837179577 Fail to reject the null hypothesis for MainScreenAppear, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 98.58% 98.27% Event: OffersScreenAppear: p-value: 0.43425549655188256 Fail to reject the null hypothesis for OffersScreenAppear, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 61.28% 60.35% Event: PaymentScreenSuccessful: p-value: 0.6004294282308704 Fail to reject the null hypothesis for PaymentScreenSuccessful, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 47.19% 46.55% Event: Tutorial: p-value: 0.764862472531507 Fail to reject the null hypothesis for Tutorial, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 11.23% 11.00%
5.b.2 A combo/B test Conclusion
Combined Control Group 249 (246 + 247) vs Test Group 248 Conclusion
5.c What significance level have you set to test the statistical hypotheses mentioned above?
5.c.1 Calculate how many statistical hypothesis tests you carried out.
5.c.2 With a statistical significance level of 0.1, one in 10 results could be false. What should the significance level be?
# calculate the FWER
desired_alpha = 0.05
num_tests = 15
fwer = 1 - (1 - desired_alpha) ** num_tests
print('FWER for these experiments is {}, this alpha to be used in the Bonferroni correction.'.format(fwer))
print('This means there is a {:.2%} probability of at least one false positive result out of {} tests'.format(fwer, num_tests))
FWER for these experiments is 0.536708769840247, this alpha to be used in the Bonferroni correction. This means there is a 53.67% probability of at least one false positive result out of 15 tests
# calulate the bonferonni correction
bonferonni_correction = fwer
print('bonferonni correction: {}'.format(bonferonni_correction))
bonferonni correction: 0.536708769840247
# rerun all A/B test with Bonferroni correction
# test control group vs test group
print('Bonferroni corrected control group 246 vs test group 248\n')
test_hypothesis(exp_a = '246', exp_b = '248', alpha = bonferonni_correction)
# test control group vs test group
print('Bonferroni corrected control group 247 vs test group 248\n')
test_hypothesis(exp_a = '247', exp_b = '248', alpha = bonferonni_correction)
# test combined control group vs test group
print('Bonferroni corrected control groups combined 249 (246 + 247) vs test group 248\n')
test_hypothesis(exp_a = '249', exp_b = '248', alpha = bonferonni_correction)
Bonferroni corrected control group 246 vs test group 248 Event: CartScreenAppear: p-value: 0.07842923237520116 Reject the null hypothesis for CartScreenAppear, there is a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 50.97% 48.48% Event: MainScreenAppear: p-value: 0.2949721933554552 Reject the null hypothesis for MainScreenAppear, there is a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 98.63% 98.27% Event: OffersScreenAppear: p-value: 0.20836205402738917 Reject the null hypothesis for OffersScreenAppear, there is a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 62.08% 60.35% Event: PaymentScreenSuccessful: p-value: 0.2122553275697796 Reject the null hypothesis for PaymentScreenSuccessful, there is a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 48.31% 46.55% Event: Tutorial: p-value: 0.8264294010087645 Fail to reject the null hypothesis for Tutorial, there is not a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 11.19% 11.00% Bonferroni corrected control group 247 vs test group 248 Event: CartScreenAppear: p-value: 0.5786197879539783 Fail to reject the null hypothesis for CartScreenAppear, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 49.26% 48.48% Event: MainScreenAppear: p-value: 0.4587053616621515 Reject the null hypothesis for MainScreenAppear, there is a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 98.53% 98.27% Event: OffersScreenAppear: p-value: 0.9197817830592261 Fail to reject the null hypothesis for OffersScreenAppear, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 60.49% 60.35% Event: PaymentScreenSuccessful: p-value: 0.7373415053803964 Fail to reject the null hypothesis for PaymentScreenSuccessful, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 46.08% 46.55% Event: Tutorial: p-value: 0.765323922474501 Fail to reject the null hypothesis for Tutorial, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 11.26% 11.00% Bonferroni corrected control groups combined 249 (246 + 247) vs test group 248 Event: CartScreenAppear: p-value: 0.18175875284404386 Reject the null hypothesis for CartScreenAppear, there is a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 50.11% 48.48% Event: MainScreenAppear: p-value: 0.29424526837179577 Reject the null hypothesis for MainScreenAppear, there is a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 98.58% 98.27% Event: OffersScreenAppear: p-value: 0.43425549655188256 Reject the null hypothesis for OffersScreenAppear, there is a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 61.28% 60.35% Event: PaymentScreenSuccessful: p-value: 0.6004294282308704 Fail to reject the null hypothesis for PaymentScreenSuccessful, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 47.19% 46.55% Event: Tutorial: p-value: 0.764862472531507 Fail to reject the null hypothesis for Tutorial, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 11.23% 11.00%
4 groups of tests were performed:
A function was created to calculate the number and shares of users per group per event and tested if they were significantly different
extra test to compare the control groups with the bonferonni correction
# Bonferonni correction test: control group vs control group
print('Bonferroni corrected control group 246 vs control group 247\n')
test_hypothesis(exp_a = '246', exp_b = '247', alpha = bonferonni_correction)
Bonferroni corrected control group 246 vs control group 247 Event: CartScreenAppear: p-value: 0.22883372237997213 Reject the null hypothesis for CartScreenAppear, there is a signficant difference between the proportions of experiments 246 and 247. Experiment: 246 247 Number of users: 2484 2513 Share of users: 50.97% 49.26% Event: MainScreenAppear: p-value: 0.7570597232046099 Fail to reject the null hypothesis for MainScreenAppear, there is not a signficant difference between the proportions of experiments 246 and 247. Experiment: 246 247 Number of users: 2484 2513 Share of users: 98.63% 98.53% Event: OffersScreenAppear: p-value: 0.2480954578522181 Reject the null hypothesis for OffersScreenAppear, there is a signficant difference between the proportions of experiments 246 and 247. Experiment: 246 247 Number of users: 2484 2513 Share of users: 62.08% 60.49% Event: PaymentScreenSuccessful: p-value: 0.11456679313141849 Reject the null hypothesis for PaymentScreenSuccessful, there is a signficant difference between the proportions of experiments 246 and 247. Experiment: 246 247 Number of users: 2484 2513 Share of users: 48.31% 46.08% Event: Tutorial: p-value: 0.9376996189257114 Fail to reject the null hypothesis for Tutorial, there is not a signficant difference between the proportions of experiments 246 and 247. Experiment: 246 247 Number of users: 2484 2513 Share of users: 11.19% 11.26%
General Information
Sales/Event Funnel
A/A/B test on font styles
Initially these test were performed with out the Bonferonni Correction.
The Bonferonni Correction was calculated:
General Conclusion
Recommendations
Updates to the summary above
# performed post program completion
# correct bonferonni correction after learning the right way from project 12.1
correct_bon_corr = .05 / 15 # desired alpha / number of tests
print('correct Bonferonni Correction: {}'.format(correct_bon_corr))
correct Bonferonni Correction: 0.0033333333333333335
# performed post program completion
# calculate the FWER
correct_fwer = 1 - (1 - correct_bon_corr) ** num_tests
print('FWER for these experiments is {}, this alpha to be used in the Bonferroni correction.'.format(correct_fwer))
print('This means there is a {:.2%} probability of at least one false positive result out of {} tests'.format(correct_fwer, num_tests))
FWER for these experiments is 0.04885001789563237, this alpha to be used in the Bonferroni correction. This means there is a 4.89% probability of at least one false positive result out of 15 tests
# performed post program completion
# rerun all A/B test with CORRECT Bonferroni correction
# test control group vs test group
print('Corrected: Bonferroni corrected control group 246 vs test group 248\n')
test_hypothesis(exp_a = '246', exp_b = '248', alpha = correct_bon_corr)
# test control group vs test group
print('Corrected: Bonferroni corrected control group 247 vs test group 248\n')
test_hypothesis(exp_a = '247', exp_b = '248', alpha = correct_bon_corr)
# test combined control group vs test group
print('Corrected: Bonferroni corrected control groups combined 249 (246 + 247) vs test group 248\n')
test_hypothesis(exp_a = '249', exp_b = '248', alpha = correct_bon_corr)
Corrected: Bonferroni corrected control group 246 vs test group 248 Event: CartScreenAppear: p-value: 0.07842923237520116 Fail to reject the null hypothesis for CartScreenAppear, there is not a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 50.97% 48.48% Event: MainScreenAppear: p-value: 0.2949721933554552 Fail to reject the null hypothesis for MainScreenAppear, there is not a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 98.63% 98.27% Event: OffersScreenAppear: p-value: 0.20836205402738917 Fail to reject the null hypothesis for OffersScreenAppear, there is not a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 62.08% 60.35% Event: PaymentScreenSuccessful: p-value: 0.2122553275697796 Fail to reject the null hypothesis for PaymentScreenSuccessful, there is not a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 48.31% 46.55% Event: Tutorial: p-value: 0.8264294010087645 Fail to reject the null hypothesis for Tutorial, there is not a signficant difference between the proportions of experiments 246 and 248. Experiment: 246 248 Number of users: 2484 2537 Share of users: 11.19% 11.00% Corrected: Bonferroni corrected control group 247 vs test group 248 Event: CartScreenAppear: p-value: 0.5786197879539783 Fail to reject the null hypothesis for CartScreenAppear, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 49.26% 48.48% Event: MainScreenAppear: p-value: 0.4587053616621515 Fail to reject the null hypothesis for MainScreenAppear, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 98.53% 98.27% Event: OffersScreenAppear: p-value: 0.9197817830592261 Fail to reject the null hypothesis for OffersScreenAppear, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 60.49% 60.35% Event: PaymentScreenSuccessful: p-value: 0.7373415053803964 Fail to reject the null hypothesis for PaymentScreenSuccessful, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 46.08% 46.55% Event: Tutorial: p-value: 0.765323922474501 Fail to reject the null hypothesis for Tutorial, there is not a signficant difference between the proportions of experiments 247 and 248. Experiment: 247 248 Number of users: 2513 2537 Share of users: 11.26% 11.00% Corrected: Bonferroni corrected control groups combined 249 (246 + 247) vs test group 248 Event: CartScreenAppear: p-value: 0.18175875284404386 Fail to reject the null hypothesis for CartScreenAppear, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 50.11% 48.48% Event: MainScreenAppear: p-value: 0.29424526837179577 Fail to reject the null hypothesis for MainScreenAppear, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 98.58% 98.27% Event: OffersScreenAppear: p-value: 0.43425549655188256 Fail to reject the null hypothesis for OffersScreenAppear, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 61.28% 60.35% Event: PaymentScreenSuccessful: p-value: 0.6004294282308704 Fail to reject the null hypothesis for PaymentScreenSuccessful, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 47.19% 46.55% Event: Tutorial: p-value: 0.764862472531507 Fail to reject the null hypothesis for Tutorial, there is not a signficant difference between the proportions of experiments 249 and 248. Experiment: 249 248 Number of users: 4997 2537 Share of users: 11.23% 11.00%